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ABSTRACT 


Optimal  properties  are  derived  and  some  new  geometrical  interpretations 
given  for  principal  components.  Typically,  our  main  results  concern  the 
simultaneous  minimization  of  eigenvalues  of  certain  covariance  matrices 
which  measure  the  goodness  of  an  approximation.  Many  popular  criteria 
like  total  variance  and  generalized  variance,  which  are  increasing  func- 
tions of  the  eigenvalues,  are  then  minimized  by  the  best  approximator. 

In  other  situations,  the  criterion  may  not  be  a monotone  function  of 
the  eigenvalues.  In  Theorem  3.2,  we  derive  a general  optimal  class  based 
on  the  non-negative  definite  ordering  of  covariance  matrices. 

Theorem  4.1  gives  a result  for  the  sequential  selection  of  principal 
components.  In  the  final  section,  we  give  a new  geometrical  interpreta- 
tion of  the  sample  principal  components. 


AMS  (MOS)  Subject  Classification:  62H25 

Key  Words:  Principal  components.  Statistical  approximations 
Work  Unit  Number  4 (Probedsility , Statistics,  and  Combinatorics) 
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SIGNIFICANCE  AND  EXPLANATION 

When  a large  number  of  characteristics  are  measured  on  a large  number 
of  population  units,  the  sheer  volume  of  data  can  cause  problems.  It  is 
natural,  in  such  cases,  to  look  for  ways  to  reduce  that  data  to  a more 
manageable  form. 

One  way  of  doing  this  is  the  Principal  Component  Method,  wherein  the 
original  problem  is  replaced  by  an  approximation  of  far  lower  dimension. 

The  variables  in  the  approximate  problem  are  certain  linear  combinations, 
or  principal  components,  of  the  variables  in  the  original  problem. 

Of  course,  some  choices  of  the  exact  linear  combinations  to  be  used 
(approximators)  will  yield  more  meaningful  results  than  others.  Ideally,  we 
would  like  to  retain  as  much  information  as  possible,  and  we  seek  a set  of 
principal  components  which  is  optimal  from  this  point  of  view. 

In  this  paper,  we  derive  some  new  properties  of  optimal  principal 
component  approximators  and  obtain  some  further  geometrical  insights  concern- 
ing this  method.  We  also  help  clarify  a potential  weakness  in  this  method 
by  determining  an  even  more  general  class  of  approximators,  that  contains 
those  selected  by  the  principal  component,  and  exhibiting  a situation 
where  the  approximator  that  would  be  selected  by  that  method  is  not  optimal. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  authors  of  this  report. 


SOME  OPTIMAL  PROPERTIES  AND  INTERPRETATIONS  OF  PRINCIPAL  COMPONENTS 
Raul  Hudlet  and  Richard  A.  Johnson 


1.  Review  of  Previous  Work 

Let  X = (X, )'  be  a random  vector  with  zero  expectation  and  covariance  matrix 

1 p 


!;  and  let  i ^ ^ denote  the  eigenvalues  of  T.  and  P^ 


,Pp  a corresponding 

set  of  orthonormal  eigenvectors.  Notice  that  Pj^,...,P^  are  uniquely  determined  (up 
to  multiplication  by  ±1)  only  if  > . . . > Xp. 

Even  though  Pearson  had  encountered  principal  components  as  early  as  1901,  the 
concept  is  generally  attributed  to  Hotelling  (1933)  who  was  the  first  to  introduce  it  in 
a probabilistic  framework.  Since  then  Girshick  (1936),  Anderson  (1958),  Rao  (1964), 
Darroch  (1965),  Okamoto  and  Kanazawa  (1968)  among  others  have  characterized  principal 
components  by  different  sets  of  optimality  properties. 

Girshick  (1936)  showed  that  if  the  components  of  X have  variance  one,  then 
a first  principal  component  (P.C.),  maximizes  the  sum  of  the  squares  of  the  correlation 
between  f*X  and  each  variate  X^  over  all  possible  linear  functions  f'X. 

Anderson  (1958)  established  that  among  the  class  of  linear  functions  i'X  with 
t' I =1;  a first  P.C.,  P^X  has  maximum  variance;  ^ second  P.C.,  has  maximum 

variance  among  the  elements  in  the  class  uncorrelated  with  P'^X  and  so  on. 

Rao  (1964)  characterizes  the  first  k(<  p)  principal  components  as  a linear  form 


y = TX,  where  T is  a k x p matrix,  which  minimizes  the  trace  or  the  Euclidean  norm 
,2  ^ ..2 
i.  j 


(||m||^  ~ 1.  covariance  matrix  of  the  residual  of  X minus  the  best  linear 


predictor  based  on  Y. 

Darroch  (1965)  was  the  first  to  characterize  principal  components  within  the  class 
of  all  random  variables  with  at  most  k(^  p)  dimensions.  In  this  formulation  it  is 
desired  to  approximate  the  p component  vector  X by  a linear  form  AY  of  a k x 1 
random  vector  Y where  A is  a p x k matrix  of  constants.  The  error  of  approximation 
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F,  may  be  measured  by  the  tr  ice  of  the 

residual  covariance  matrix  = E(X  - AY) (X  - AY)'  . (1.1) 

Darroch  showed  that  F is  minimized  with  respect  to  A and  Y when  and  only  when 

AY  = P (P,'X)  + •••  + P,  (P'X)  (1.2) 

-1  -1'  -)t  -Jc- 

which  is  a linear  transformation  of  a set  of  )c  first  principal  components.  Then,  the 
minimum  value  is 

F = + •••  + A 

){+l  p 

where,  P^, ,Pj^  is  any  set  of  Ic  orthonormal  eigenvectors  of  Z corresponding,  respec- 

tively, to  the  eigenvalues  A^,...,Aj^. 

O)camoto  and  Kanazawa  (1968)  generalized  Darroch 's  result,  by  allowing  F to  be  any 

function  of  the  eigenvalues  of  the  residual  covariance  matrix,  which  is  strictly  increasing 

in  each  of  its  p arguments.  Examples  of  such  functions  are  the  trace  and  the  Euclidean 

norm.  They  showed  that  F is  minimized  when  and  only  when  Ay  is  as  in  (1.2),  and 

then  F = F(Aj^^^ , . . . , A^,0, . . . ,0)  . A nice  review  of  these  results  appears  in  0)camoto  (1969). 

Our  extensions  concern  the  complete  residual  covariance  matrix  rather  than  just 

increasing  functions  of  the  eigenvalues.  These  latter  results  require  invariance  under 

orthogonal  transformations  and  do  not  allow  an  investigator  to  single  out  special  components 

of  the  observation  vector.  After  first  establishing  some  preliminary  results  in  Section  2, 

we  show  in  Section  3 that  if  the  criterion  for  the  fit  is  the  non-negative  definite  partial 

ordering  on  the  residual  covariance  matrix,  then  in  general  no  overall  optimal  AY  exists. 

However  Theorem  3.2  establishes  that  an  optimal  class  does  exist.  The  implications  of 

this  extension  are  discussed  in  Section  3.3. 

Principal  components  may  also  be  introduced  sequentially  one  at  a time,  by  setting 

Ic  * 1 above  and  demanding  that  at  the  jth  stage,  *j-j'  uncorrelated  with  tlxjse 

selected  at  a previous  stage  and  that  the  residual  covariance  matrix  E(X  - A,Y.)(X  - A.Y,)' 

- 3-3  - 3'j 

have  eigenvalues  that  are  as  small  as  possible.  This  is  the  approach  taken  in  Section  4 


leading  to  Theorem  4.1. 
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Section  5 illustrates  how  the  results  of  Oarroch  (1965)  and  Oliamoto  and  Kanazawa 


t 


1 


(1968)  can  be  viewed  as  the  population  analoqes  to  a generalization  of  Pearson's  (1901) 
approach.  Both  Darroch  (1965)  and  O)iainoto  and  Kanazawa  seem  to  have  been  unaw.xre  of 
this  fact.  We  also  obtain  a result.  Theorem  5.4,  on  the  approximation  of  cross  products 
matrices. 

Lastly  in  Section  5.2  a seemingly  new  interpretation  is  given  of  the  sample  principal 

..n 

components  in  R . 

2.  A Preliminary  Formulation 

We  are  interested  in  approximating  the  vector  X by  a random  vector  AY  when  goodness 
is  measured  by  the  covariance  (1.1).  Different  choices  of  AY  produce  different 
residual  covariance  matrices  and  in  order  to  compare  them,  we  give  the  following  definition. 

A partial  ordering  in  the  class  of  all  non-negative  definite  matrices  A is  defined 
by  the  relation  > where 

A > B iff  A - B e A . (2.1) 

The  problem  then  becomes  that  of  finding  the  AY  which  ma)(es  (1.1)  as  small  as  possible. 

We  begin  by  noticing  that  there  is  no  loss  in  assuming  EX  = 0.  If  this  is  not  the 
case  and  EX  = u say,  then  similar  to  the  regression  situation  where  the  best  fitting 
polynomial  is  not  forced  to  pass  through  the  origin,  we  consider  the  residual  to  be 
defined  by 

E{X  - n - AY)  (X  - n - AY)  • (2.2) 

where  n is  a p x 1 un)tnown  constant  vector  that  one  is  allowed  to  vary  when  searching 
for  a minimum. 

If  EY  '*■  0,  the  term  AEY  may  be  absorbed  into  the  n by  replacing  X - n - AY 
by  ? ■ (D  * AEY)  - (A(Y  - EY) 1 . Since  n is  arbitrary,  n + AEY  is  also  arbitrary 
and  consequently  we  may  restrict  attention  to  Y variables  with  zero  exp>ectation.  Then 
E(X  - n - AY)  (X  - n - AY)  ' = E(X  - u - AY  - (n  - M ) ) (K  - U - AY  - (n  - U ) ) ' 

» E(X  - p - AY)  (X  - y - AY)  ' + (n  - VJ)  (h  - p)  ' 

> E(X  - p - Ap  (X  - p - AY)  • (2.3) 

with  strict  inequality  unless  n “ P. 
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Thus,  in  (2.3),  we  must  ta)ce  n = y.  By  assuming  X has  already  been  corrected 
for  its  mean  we  get 

E(X  - Af)  (X  - AY)  • (2-4) 

which  is  just  the  matrix  in  the  original  definition  (1.1).  Consequently,  we  may  assijme 
EX  = EY  = 0. 

Also  without  loss  of  generality  we  may  assume  E(YY')  = for  even  if  ran)t  of 

E(Yy')  < Ic  we  may  replace  AY  by  A*Y*  where  Y*  is  such  that  E(Y*Y**)  = and 
AY  = A*Y*  (a.s.) . 

Finally  notice  that  if  r = ran)t  of  Z and  r ^ )^,  the  problem  is  trivial  since 

we  obtain  a perfect  fit  with  A = [P, ,...,P  ,0,...,0]  and  Y'  = (P ' X, . . . , P ' X, . . . , P ' X) 

-1  -r  ~ ' '1'  -r-  It- 

so  AY  = X and 

0 = E(X  - AY)  (X  - AY)  ' <_  E(X  - AY)  (X  - AY)  • (2.5) 

with  strict  inequality  for  any  choice  of  AY  unless  AY  = AY  (a.s.). 

Thus  in  the  rest  of  this  paper,  unless  otherwise  stated,  we  will  assume  that 

EX  = EY  = 0,  EYY'  =1^-  It  < r . (2.6) 


Lemma  2.1.  Under  assumptions  (2.6)  let 

B = cov(X,Y)  . (2.7) 

Then  with  strict  inequality  unless  B = A 

E(X  - AY)  (X  - AY)  ’ > E - BB'  0 . (2.8) 

Proof . 

E (X.-  AY)  (X  - AY)  • = E (X  - BY  + (B  - A)  Y ) (X  - BY  + (B  - A)  Y)  ' E (X  - BY)  (X  - BY)  ' = Z - BB  ' . o 


3.  A Class  of  Best  Approximators 

The  best  current  result,  on  the  approximation  of  X by  BY,  is  due  to  0)camoto  and 
Kanazawa  (1968).  However,  they  restrict  tliemselves  to  functions  of  eigenvalues  of  the 
residual  matrix  E(X  - BY) (X  - BY) ‘ which  are  increasing  in  each  argument. 
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Theorem  3.1  [Okamoto  and  Kanazawa].  Let  ^ have  eigenvectors  corresponding 

to  the  eigenvalues  i ^2  i i assumptions  (2.6),  any  strictly 

increasing  function  of  the  eigenvalues  of 

E(X  - AX) (X  - AX) ' 

is  minimized  with  respect  to  A and  X by  the  choice 

AX  = P,P'X  + • • • + P,  P'X  . 

-I'l-  ')c-k- 


That  is,  the  eigenvalues  of  the  residual  matrix  are  simultaneously  minimized.  d 

Remark.  One  can  conclude  from  Theorem  3.1  that  the  sum  of  residual  variances  (trace)  is 
minimized,  the  generalized  variance  (det.)  is  minimized,  or  the  sum  of  squares  of  all 
entries  (sum  of  squared  eigenvalues)  is  minimized. 

We  show  below  that  dropping  the  invariance  condition,  implied  by  the  restriction 
to  functions  of  eigenvalues,  leads  to  a whole  class  of  optimal  solutions  derived  from 
the  partial  ordering  of  non-negative  definiteness. 

3.1.  Best  Approximators 

Lemma  2.1  tells  us  that,  in  trying  to  minimize  (1.1)  under  assumptions  (2.6),  we 
can  only  improve  the  approximation  if  the  matrix  of  A coefficients  of  Y is  chosen  as 
the  covariance  between  X and  Y.  Now  B must  be  found  such  that  Z - BB'  is  as 
"small"  as  possible  and  then  a Y that  produces  such  a B found.  The  next  theorem  will 
give  us  the  structure  of  the  non-negative  definite  covariance  matrix  Z - BB’ . It  will 
be  seen  that  there  is  no  B which  uniformly  minimizes  (2.8)  but  rather  a collection  of 
B's  which  are  optimal  in  a sense. 

Theorem  3.2.  Let  i 0 be  of  order  p and  rank  r and  have  eigenvectors 
P = And  eigenvalues  i i i 0.  Then  under  assumptions  (2.6),  set 

A - diag(Xj,...,X^)  and 


r * • 1 » * 

r - (B  Y |b  Y = P 


^k  ° 


Lo  0. 

0 


R'A-^/2  0 


P'X;  R r X r orthogonal} 
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so  that  by  changing  R 


^ R‘ 

0 0 


the  different  elements  of  F,  are  obtained.  Then 

k 


(1)  ICompleteness  of  Fj^l  . Given  any  random  vector  AY,  there  exists  an  element  B*V* 
in  the  class  such  that 


E(X  - B*Y*) (X  - B*Y») ■ < E(X  - AY) (X  - AY) ' 
with  strict  inequality  unless  (a.s.)  AY  » B»Y». 

(2)  [Minimality  of  F,  ) . Let  BY,  in  F , be  given.  There  is  no  iy  in  F.  (with 

^ * Jc  *•  IC 

BY  not  a.s.  equal  to  B»Y*)  such  that 

E(x  - iy) (X  - iy) • ^ e(x  - b*y*) (x  - b»y*) ' . a 

3.2.  Proof  of  Theorem  3 ._2 

We  now  proceed  to  develop  a proof  of  Theorem  3.2  through  a series  of  results. 
Theorem  3.3.  Let  E of  order  p and  rank  r be  n.n.d.  and  B be  p x k such  that 

E - BB'  ^ 0.  Then  there  exists  an  orthogonal  matrix  R of  order  r such  that 


BB'  = P 


D 0 

R R 

0 0 


0 


where  D = diag(d  , . . . ,d  ) ; 1 > d,  > • • • > d,  >0;  A = diag(l,  ,...,!  ) and  P is 
1 k — 1—  — k—  1 r 

orthogonal  of  order  p and  its  ith  column  P^  is  an  eigenvector  of  E,  corresponding 
to  the  ith  largest  eigenvalue  of  E,  namely  X^(i  = l,...,p). 

Proof.  By  the  spectral  decomposition  theorem,  an  orthogonal  matrix  P exists  such  that 


A 0 
? t 

0 0 


where  A is  a r x r diagonal  matrix  with  diagonal  entries  i i i 0-  Then 

A ()]  AO 

E - BB'  = P P'  - BB'  > 0 <=>  - P'BB'P  ^ 0 . ( 

0 0 0 0 

From  (3.1)  it  i"  lear  that  the  lower  right  hand  corner  of  P'BB'P  must  be  zero. 
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for  some  r « k matrix  C.  Equation  (3.2)  then  takes  the  form 


o 

1 

CC  0 

A - CC  0 

- 

* 

o 

o 

o 

o 

1 

O 

O 

and  we  may  thus  restrict  attention  to  the  upper  left  corners. 


A - CC  > 0 <=■=>  >_  0 

Again  by  the  spectral  theorem  an  orthogonal  matrix  R exists  such  that 

D 0 


R'A-^/^CC’A-^/^R  = 


0 0 


where  D = diag  (dj^,  . . . .d^^) ; d^^  ^ •••  i Substituting  in  (3.4) 


Ij.  - A "'‘CCA 


> 0 <=>  I - 
— r 


D 0 
0 0 


From  (3.2) 


B = P 


> 0 . 


(3.3) 


(3.4) 


(3.5) 


(3.6) 


Equation  (3.5)  gives 


which  in  (3.7)  yields 


BB'  = P 


CC  = A^^^R 


CC  0 

0 0 

D 0 
0 0 


P'  . 


R'A^/2 


BB'  = P 


D 0 
0 0 
0 


R'A^'^^  0 


Finally,  (3.6)  shows  1 i ^ ^ d|^  ^ 0.  o 

Theorem  3.4  shows  that  the  choice  D » I,  is  best. 

k 


(3.7) 


(3.8) 
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Theorem  3.4.  Under  the  conditions  of  Theorem  3.3,  let 


B»B*'  = P 


P'  . 


|0  0| 

0 

t. 

Then 

(i)  0 ^ Z - B*B*'  ^ Z - BB' 
with  strict  inequality  unless  B*B*'  = BB'. 

(ii)  (Admissibility  of  B*) . If  B is  an  arbitrary  p x k matrix  such  that 

0<Z-ii'£Z-  B*B»' 

then 

ii'  = B*B*'  . 

Proof.  (Z  - BB')  - (Z  - B*B*')  = B*B»'  - BB'  . 

By  Theorem  3.3  and  (3.9),  this  can  be  written  as 


P' 


r 1 

- 

r 1 

- 

a^/2^ 

R'A^/2  q 

a^/2. 

D 0 

R'a1/2  ^ 

p 

0 0 

p»  _ p 

0 0 

^ eJ 

- 

0 

0 

0 

0 

= P 


A^/2, 


I,  - D 0 
k 


R'A^^^  0 


P'  > 0 


|0  0| 

0 

since,  by  (3.6),  1 ^ L • • • L Notice  the  inequality  is  strict  unless  D = 

in  which  case  BB'  = B*B*'. 

Suppose  B exists  such  that  (3.11)  holds.  Then,  proceeding  as  in  Theorem  3.3, 
R orthogonal  and  D diagonal  exist  (the  notation  being  clear)  such  that 


r I 


BB'  = P 


D 0 
0 0 
0 


i'A^/^  0 


P'  . 


(3.9) 


(3.10) 


(3.11) 


(3.12) 


(3.13) 


(3.14) 
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From  (i)  of  the  present  theorem 


B»B*’ 


° 


0 0 
0 


P' 


is  such  that 


0 < Z - B*B*'  < E - BB'  < E - B*B* ' 


Equations  (3.9)  and  (3.15)  show  that  (3,16)  is  equivalent  to 


^k  ° 


R'  > R 


'k  ° 


R'  . 


(3.15) 


(3.16) 


(3.17) 


It  is  easy  to  see  that  (3.17)  can  hold  only  if  the  equality  holds,  for  it  is  clear 
that  the  first  column  of  R,  say  r^^,  is  an  eigenvector  of  the  right  hand  side  of 
(3.17)  corresponding  to  the  eigenvalue  1.  Since  the  eigenvalues  are  l*s  and  O's, 
by  the  Courant-Fischer  minimax  theorem, 


= 1 .1 


'k  ° 


R'r^  < 1 . 


For  equality,  must  be  an  eigenvector  of  the  left  hand  side  of  (3,17)  correspond- 

ing to  the  eigenvalue  1.  Similarly  it  can  be  shown  that  the  other  columns  of  R are 
eigenvectors  of  both  sides  corresponding  to  the  same  eigenvalues  and  thus  the  two  sides 
have  the  same  eigenvalues  and  a common  set  of  orthonormal  eigenvectors  which  implies 


^k  ° 


R'  = R 


^k  ° 


R*  . 


(3.18) 


L° 

(Notice  however  this  does  not  imply  R = R) . o 

The  next  lemma  gives  us  the  general  form  of  B». 

Lemma  3.5.  If  B*B*'  is  as  in  (3.9)  and  B*  is  p x k then  B*  is  of  the  form 

1/2  fcl 

A R 

B*  = P 


(3.19) 


where  Q is  an  arbitrary  orthogonal  matrix  of  order  k. 
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% 


H'B*  = ( (Q- ;0)R'A‘^^^,  OlP’P 


(3.2A) 


the  conditions  on  the  moments  follow. 

If  y also  satisfies  (3.23),  then 


E(Y  - H'x)(y  - H'X)'  = 1^  - E(YX'H)  - E(H'XX')  + Ij^  •=  0 


(3.25) 


so  Y = H'X  = y*  (a.s.) . 

Remar  It.  It  must  be  noticed  that  even  though  (3.21)  shows  that  B*  is  arbitrary  up  to 
a choice  of  Q so  Y*  in  Lemma  3.6  may  change,  the  product  B*Y*  is  invariant. 


B*Y*  = p 


A^/2^ 


a1/2r, 


. “1/2 

{ (Q' •0)R*A  ' 0)P'X 


0 0 
0 


R’A'^^^ 


P'X 


(3.26) 


which  does  not  depend  on  Q.  We  are  now  ready  to  prove  our  main  result. 

Proof  (Theorem  3.2).  (1)  Let  AY  be  given.  With  B = cov(X,Y),  Lemna  2.1  establishes 

that 


E(X  - BY)  (X  - BY)  • = J:  - BB'  < E{X  - AY)  (X  - AY)  ' (3.27) 

with  strict  inequality  unless  B = A. 

Since  E - BB'  ^ 0 from  (2.8),  Theorem  3.4  gives  B*  such  that 

0 E - B*B*'  < E - BB'  (3.28) 

with  strict  inequality  unless  B*B*'  “ BB'.  Next,  let  Y*  be  as  in  Lemma  3.6  so 

E(X  - B*Y*)  (X  - B*Y»)  ' = E - B*B* ' E (X  - AY)  (X  - AY)  ' (3.29) 

with  strict  inequality  unless  B*B*'  ” BB'  and  B = A.  Suppose  that  equality  holds. 

Then  B*B* ' = BB'  and  by  Lemma  3.5  applied  to  BB',  there  exists  an  orthogonal  Q, 
of  order  Ic,  such  that 
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B 


(3.30) 


Lemma  3.6  then  implies  that  (a.s.) 

Y=  ((Q’;0)R'A  "''‘,0)P'X  (3.31) 

and  by  the  remark  after  Lemna  3.6,  B»Y*  = BY  (a.s.). 

For  the  proof  of  (2),  we  construct  i*Y»,  just  as  B»Y*  was  constructed  in  part  (1), 
such  that 

E(X  - i»Y*)  (X  - i*Y*) ' = Z - i*B*'  1 E(X  - B*Y»)  (X  - B*Y*) ' (3.32) 

That  is,  I - B*i*'  - B*B*’  but  then  part  (ii)  of  Theorem  3.4  implies  that 

B*B* ' = As  in  part  (i),  it  then  follows  that  B*Y»  = B*Y*  (a.s.). 

3.3.  Some  Implications  of  Theorem  3.2 

When  predicting  X by  BY  with  Y )c  x l,  if  loss  is  measured  by  a function  of 

the  eigenvalues  of  the  residual  covariance  matrix,  then  Ofcamoto  and  Kanazawa  (1968)  give 

conditions  under  which  principal  components  are  optimal.  Other  objectives,  which  are 

not  expressible  as  functions  of  the  eigenvalues,  lead  to  the  selection  of  other  members 

of  the  class  T,  . 

k 

To  illustrate  this  point,  we  consider  the  case  )c  = 1.  Suppose  that,  for  a given 
vector  a,  it  is  desired  to  find  BY  such  that 

a'E(X  - BY) (X  - BY) "a  (3.33) 

is  a minimum.  This  monotone,  non-decreasing  loss  function  over  the  class  of  non-negative 
definite  matrices  is  not  a function  of  the  eigenvalues.  Here 
4'E(X  - BY) (X  - BY) 'a  = a' (Z  - BB’)a  = 

Var(a’X)  - Cov^(a'X,Y)  > Var(a'X)  - Var (a ■x)Var (Y)  = 0 
with  equality  if  and  only  if  Y « a'X.  That  is  if  Y » la'Za)  ^''^a'X.  Then 

B - E(XX*a/(a'j:a)^/^)  - (a'ra)"^/^Ea 

BY  » (a'Za)"^  (a'X)5:a  (3  '!) 
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with  this  choice  of  BY,  the  residual  covariance  matrix  is 

1 - BB*  = 1 - (a‘Ea)”^Zaa'Z  (3.35) 

and  the  loss  in  (3.33)  is  zero. 

Now  let  PAp’  be  the  spectral  decomposition  of  1 (we  assume  |£|  ^0)  with  P 

-1/2 

orthogonal  and  A = diag(A, , . . . ,X  );  A,  > ...  > X . We  may  write  a = cPA  r where 

Ipl—  -p 

p 

c IS  a constant  and  re  R , r'r  = 1.  nen,  BY  may  be  expressed  as 

BY  = (a'£a)“^(a'>t)Za  - PA^'^^rr ' A'^^^P'X  = PA^'^^R 

where  R = (r.r^, . . . .r^)  is  orthogonal.  Consequently  BY  is  in  the  class  defined 

in  Theorem  3.2.  That  is,  over  all  possible  predictors,  the  loss  (3.33)  is  minimized 
when  BY  is  as  in  (3.34)  which  is  a member  of  F^. 

As  a further  specialization,  suppose  our  main  concern  is  variable  one.  To  reflect 
this,  we  may  talce  a*  = (1,0,..., 0)  so  that  (3.33)  equals  entry  (1,1)  of  the  residual 
covariance  matrix.  In  this  situation,  we  may  simply  ta)ie  (X^^,  0, . . . ,0)  as  our  predictor 
and  have  X - (X, ,...,0)'  = (0,X-,...,X  )',  which  has  residual  covariance  matrix  with 

2 p 

entry  (1,1)  equal  to  zero.  However  the  predictor  (Xj^,0,  — ,0)  ' does  nothing  to  predict 
the  remaining  variables  and  conceivably,  one  can  find  a predictor  that  is  as  good  at 
predicting  Xj^  and  yet  gives  better  prediction  of  the  other  variables.  Any  other 
predictor  will  give  a residual  covariance  matrix  with  entry  (1,1)  greater  than  zero, 
unless  the  predictor's  first  entry  is  (a.s.)  X^^.  We  are  then  limited  to  predictors 
of  the  form  AX^^  with  A a p x i vector  having  1 as  the  first  entry.  From  (3.34), 
we  obtain  the  predictor 


1 0* 
0 0 


R'A'^'^^P'X 


(3.36) 


(3.37) 


where  is  the  first  column  of  Z,  and  its  (1,1)  entry.  This  predictor 

also  gives  a residual  covariance  matrix  with  entry  (1,1)  equal  to  zero. 


We  conclude  our  discussion  by  showing  that,  among  all  predictors  AX^^  with  first 
entry  Xj^,  (3.37)  gives  a residual  covariance  matrix  whose  entries  are  smaller  than 
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or  equal  to  those  corresponding  to  the  residual  covariance  matrix  given  by  any  other 
predictor  of  the  prescribed  form. 


To  this  end  let  a predictor  AXj^  be  written  as 


X = 


X, 

X. 

0.  , 0 * 

1 

, AX^  = 

1 

, z - 

11  -1 

z 

cx. 

0 Z 

L ij 

L'l  iiJ 

then  X - AXj^  has  residual  covariance  matrix 


0 - eg-  * ca^^c- 


but 


'u  - V - * "n'' 


7^  ■ ■'^1^ 


+ z 


— a a* 
U -l-l 


4.  Introducing  the  Principal  Components  Sequentially 

Here  we  extend  Theorem  3.1  to  the  sequential  selection  of  approximators.  Suppose 
k » 1,  so  that  the  covariance  matrices  are  of  the  form  E(X  - AY) (X  - AY) ' where  A 
is  a p X 1 vector  of  constants  and  Y a univariare  random  variable.  In  this  section, 
we  sometimes  write  llM]  to  denote  that  X is  an  eigenvalue  of  M. 

Let  Pj  be  an  eigenvector  of  Z,  corresponding  to  its  leurgest  eigenvalue,  and 
Aj^Yj^-Pj^P^X.  Theorem  3.1  establishes  that  this  AY  has  the  property  that  the  ith 
largest  eigenvalue  of  the  corresponding  residual  covariance  matrix  E(X  - Pj^P^X)  (X  - Pj^P^X)  ' 
is  less  than  or  equal  to  the  corresponding  eigenvalue  of  any  possible  residual  covariance 
matrix  (i  » 1, . . . ,p) . 

Proceeding  in  a sequential  manner,  we  seek  ^2^2  uncorrelated  with  with 

the  same  minimizing  property  among  the  class  of  AY's  uncorrelated  with  l^en 

AjYj  uncorrelated  with  with  the  minimizing  property  is  desired, 

and  so  on.  The  next  theorem  tells  us  that  principal  components  are  the  solution. 
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Theor em  4.1.  Let  X be  a p * 1 random  vector  with  zero  expectation  and  covariance 


matrix  Z of 

rank  r . 

Also 

let 

p = 

{Ay|a  is  a p X 1 vector  and  Y a random 

variable  with 

EY  = 0; 

Ey2  = 

1}. 

For 

j = l,...,r,  let  A.Y.  f P.  Then, 

A^(E(X  - A^Yj)  (X  - AjYJ')  i A^(E(X  - AY)  (X  - AY)  '1  (i  = l,...,p)  , 


for  every  AY  in  P uncorrelated  with  A,Y,,...,A,  ,y,  . if  and  only  if  A.Y.  = P.P'.X 

1 1 ]-l  j-1  j ] -j-]- 


where  P . is  an 
'3 

eigenvector  of 

Z corresponding  to  the 

eigenvalue  X^ 

(El  . 

Proof.  For  i = 

1, 

application 

of  Theorem  3 . 1 provides 

A,Yi  = P^P'X 

where 

Ej  is 

as  stated.  For 

J 

= 2.  Let  P = 

= <?!:<=>  = <?1'?2 V 

be  orthogonal 

where 

?2 ?p 

are  eigenvectors  of  Z corresponding  respectively  to  [El.  so  that 

• 2 p 


X = PP'X  = + CC'X  and 


E(X  - (X  - A^Y^)  ■ = ECPj^Pj^X  + CC'X  - A^Y^)  (Pj^PjX  + CC'X  - A^Y^)  ' 

=■  E(Pj^P|X)  (P^P^X)'  + E(CC’X  - AjY^)  (CC'X  - A^Y^)  ' 

= + E(CC'X  - A^Y^XCC'X  - A^Y^)  ' 

where  the  cross  terms  are  zero  because  PjPj?  is  uncorrelated  with  CC'X  and  A^Y^. 
The  result  now  follows  from  Theorem  3.1  by  noting  that  CC'X  has  covariance  matrix 


CC'ECC  = CC  (P^:C)A 


CC 


} x.p.p:  = 

L I'l'l 


•x 


C . 


Alternatively,  if  = EICC'XY^)  = CC'ECXY^)  = CC'u,  say.  From  Lenma  2.1,  we  have 
^1-1'i  * E(CC'X  - AjY^lCCC'X  - AjYj)  ' > Xj^P^P^  + CC'ECC  - 


^1'1-i  * ' uu'lCX' 


(4.1) 


Since  C'P^^  = 0,  by  simply  multiplying  (4.1)  on  the  right  by  P^  we  see  that  Pj^ 
is  an  eigenvector  corresponding  to  the  eigenvalue  X^^.  Similarly  any  eigenvector  of 
the  second  term,  perpendicular  to  P^,  is  also  an  eigenvector  of  the  whole  right  hand 
side  of  (4.1)  for  the  same  eigenvalue.  Thus  to  minimize  the  eigenvalues  of  (4.1)  we 
need  only  minimize  the  eigenvalues  of  the  second  term. 
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r 

But 

CC'(Z  - uu'JCC’  = E(CC'X  - A^Y^){CC’X  - A^Y^)  ' 

and  CC'X  is  a vector  with  zero  expectation  and  covariance  matrix  CC'ECC.  From  the 
result  for  j = 1,  we  must  take  A^Y^  = S2-2*  •'here  Uj  is  an  eigenvector  of  CCTCC. 

Thus  A^Yj  = ^2^2^  gives  the  result  for  j = 2.  Similarly  for  j = 2,...,r.  ° 

Remark . The  procedure  was  only  carried  through  r stages  where  r = rank  of  Z , instead 
of  p stages.  The  reason  for  this  is  that  X = + •••  + P^P^X  (a.s.)  so  that 

any  variable  uncorrelated  with  Pj^Pj^X,  . . . ,P^P^X  is  uncorrelated  with  X and  then 

E(X  - AY)  (X  - AY)  • = Z + AE(Y^)A’  > Z . 

If  one  insists  on  introducing  p components,  then  since  (as.) 

P' .,X  = •••  = P’X  = 0,  P j^,P*  .X, . . . ,P  P'X  may  be  taken  as  the  extra  components.  How- 
r+l-  p-  -r+l'r+1'  -p-p.  ' 

ever  no  matter  how  the  extra  components  are  introduced,  stages  r + l,...,p  are  irrelevant. 

Remark.  An  alternative  way  of  sequentially  introducing  the  first  r (=  rank  of  Z) 
principal  components  is  to  begin  as  above  with  Aj^V^^  = Pj^P^?  having  the  property  thi,t 
its  eigenvalues  are  less  than  or  equal  to  the  corresponding  eigenvalues  of  any  possible 
residual  covariance  matrix.  However,  as  a second  step,  we  consider  the  residual 
X - Pj^Pj^X  = Z which  itself  has  zero  expectation  and  covariance  matrix 
Z - “ f ^i^i^i"  Applying  Theorem  3.1  to  the  residual  Z,  one  finds  that 

AY  = ?2'2'  9ives  a corresponding  residual  covariance  matrix 

E(Z  - PjP^X) (Z  - PjPj?)'  = S - ■ *2-2-2 

such  that  for  i = l,...,p;  its  ith  largest  eigenvalue  is  less  or  equal  than  the  ' 

corresponding  eigenvalue  of  any  other  possible  residual  covariance  matrix.  In  this  way, 
the  first  r principal  components  may  be  introduced  sequentially  in  terms  of  approximat- 
ing successive  residuals.  We  then  have 

Corollary  4.2.  Suppose  we  select  the  approximators  one  at  a time  from  the  residual 

coveuriance  of  the  previous  stage.  Let  Z “X  - ^ P.PlX  be  the  residual  matrix  after 

■q  ' i.i  I' 

q steps.  Then  the  choice  AY  - P ,P'  ,X,  for  stage  q + 1,  minimizes  the  eigenvalues 

^ ^ -q+l-q+1-  . 

of  E(Z  - AY) (Z  - AY) • . ! 

-q  -q  I 
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5.  Some  Sample  Interpretations 


5.1 . p-Space  Interpretations 

Let  (x^,...,x^)  = X denote  n observations  on  the  p x i random  vector  X.  We 
can  think  of  the  n observations  as  n points  in  R^.  Moreover,  one  can  assign 
probability  1/n  to  each  observation  vector  and  apply  the  results  from  the  population 
model  to  obtain  the  sample  results.  We  intend,  however,  to  go  in  the  reverse  direction 
ai.d  show  how  Pearson's  (1901)  result  may  be  extended  to  hold  for  all  increasing  functions 
of  eigenv.,iues.  This  frames  the  result  by  Okamoto  and  Kanazawa  so  that  it  may  be  viewed 
as  a natural  population  analog.  In  our  development,  although  we  speak  of  projections  on 
planes  not  passing  through  the  origin  we  really  make  suitable  translations  and  take 
proper  projections  on  subspaces. 

Suppose  that  we  are  interested  in  finding  that  line  which  best  fits  the  n given 

points,  in  the  sense  that  the  sum  of  the  squares  of  the  distances  from  these  n points 

to  the  line  is  a minimum.  More  generally,  we  ask  for  the  hyperplane  of  dimension  k p) 

which  best  fits  the  data  in  the  above  sum  of  squares  sense. 

_ ^ n n 

Let  X = — ^ X.,  be  the  centroid  of  the  n points  and  S = ^ (x.  - x)  (x . - x) ' 

" i=l  i=l  - '1 

the  cross  products  matrix.  We  have  the  following  theorem  due  to  Pearson  (1901) . 

Theorem  5.1  (Pearson).  The  hyperplane  of  dimension  k(^  p)  such  that  the  sum  of 

squares  of  the  distances  from  the  points  to  the  plane  is  a minimum  is  of  the  form 

{x|x  = X + Py,);  e r’^} 

where  P = (P^^ , . . . ,Pj^)  and  its  k columns  constitute  a set  of  k orthonormal  eigen- 
vectors of  the  cross  products  matrix  S corresponding  respectively  to  the  k largest 
eigenvalues. 

If  X + PUj^  is  the  point  in  the  plane  closest  to  x^  then 


PU^  = PP' (x^  - X) 


?l?i<?i  - ?> 


P.P/(x.  - X) 
-k'k  -1 


5.1) 


and  if  i i are  the  eigenvalues  of  S 

the  points  x^  to  the  hyperplane  equals 

SS (distances)  > X,  , 
k+1 


then  the  sum  of  squared  distances  from 

+ •••  + X . (5.2) 

P 
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Pearson's  result  then  tells  us  that  the  “best"  hyperplane  is  determined  by 


,P^  and  the  point  nearest  to  is  x + - x)  + • • • + • 


We  may  also  think  of  as  determining  a privileged  direction  in  The  line 

{P^a|a  ( R}  is  the  line  perpendiculcu:  to  Pj^,...,P^  which  best  fits  the  points 

(Xj^  - x,...,x^  - x)  = X*  or  equivalently  the  line  that  best  fits  (no  restrictions  now) 

the  residuals  x*  “ •••  1^*‘  Pearson's  result  can  be 

extended  to  a statement  alsout  eigenvalues. 

Let  P be  a p x k matrix,  y = ) be  a k x n matrix  and  x a p x i 

-1  -n 

vector,  then  the  points  x + Py^  (i  = l,...,n)  all  lie  in  the  hyperplane  parallel  to 
that  spanned  by  the  columns  of  P.  Consider  the  matrix 


n 


^'i*  *-i  ” 5 ” *’'i^  ' 


(5.3) 


If  X + Py^  is  the  projection  of  x^  over  the  hyperplane  then  the  trace  of  (5.3)  gives 
us  the  sum  of  squared  distances  from  the  ^o  the  hyperplane  and  Pearson's  result 

tells  us  how  to  select  x and  P.  Here  we  show  how  to  make  the  eigenvalues  of  (5.3) 


as  small  as  possible  by  a correct  choice  of  x,P,y^, 


,y  . It  is  clear  that  the  columns 
-n 


of  P can  be  taken  to  be  orthonormal  in  the  following  optimization. 
Theorem  5.2.  To  minimize  the  eigenvalues  of  the  matrix 


y (x . - X - Py  . ) (x . - X - Py  . ) ' 
> '-1  - -1  -1  - -1 
1=1 


(5.4) 


X may  be  selected  as  x,  and 


P(y,,...,y^)  - PP'x* 
-1  «n 


where  x*  stands  for  the  matrix  x with  each  of  its  rows  corrected  for  the  mean  and 

P ,...,P  constitute  a set  of  k orthonormal  eigenvectors  of  the  cross  products  matrix 

'1  -k 

S corresponding  respectively  to  the  k largest  eigenvalues  ^ •••  ^ Thus,  any 

increasing  function  of  the  eigenvalues  is  minimized  by  this  choice. 

n 

Proof.  Without  loss  of  generality,  we  may  assume  that  Py  » J Pyj  “ 0 for,  if  this 

i-1 

is  not  the  case,  x may  be  replaced  by  x + Py  and  Py^^  by  Py^  - Py.  Suppose  then, 
Py  • 0.  We  liave  the  non-negative  definite  ordering 
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(5.5) 


i 


1 

( 

t 


I 


I ■ 5 ■ ' “I  (?  ~ 5 “ ■*■  ? " ' 

= y(x.  - X - Py,)(x.  - X - Pu . ) ' + E (x  - x)  (x  - x)  ' 

^ -1  - 

> y (x  . - X - Pp  . ) (x . - X - Pu  . ) ' 

— ^ '1  - '■1-1  - *^1 

with  strict  inequality  unless  x **  x so  that  x must  be  taken  as  the  centroid  of  x^» 

Next  we  write  Y - = (x  - x,..,,x  - i)  and  obtain 

-1  -n  -1  - -n 

n 

I (y . - Pu . ) (y . - Pp . ) • = (Y  - Pu)  (Y  - Pu)  • , u = (u  : • • • '-u  ) . 

i=l  -1-1-1  - - - -i-  .-r 

The  above  matrix  has  the  same  nonzero  eigenvalues  as  the  matrix 

(Y  - PU)  ’ (Y  - Pu) 

and  we  have 


,x  . 
-n 


(Y  - Pu) ■ (Y  - Pu)  = (Y  - PP'Y  + PP'Y  - Pu) ' (Y  - PP'Y  + PP'Y  - Pp) 

= (Y  - PP'Y)'(Y  - PP'Y)  + (PP'Y  - Pu)'(PP'Y  - Pu) 

^ (Y  - PP'Y) ' (Y  - PP'Y) 

with  strict  inequality  unless  PP'Y  = Pu  or  P'Y  = u.  With  this  choice  we  have 

(Y  - PP'Y)'(Y  - PP'Y)  = Y'(I  - PP')Y  . 

We  want  to  minimize  the  eigenvalues  of  this  matrix  or  equivalently  of  the  matrix 
(I  - PP')YY'(I  - PP')  = (I  - PP')S(I  - PP')  = QQ'SQQ' 
where  Q is  such  that  (PIQ)  is  orthogonal.  Next,  consider  the  product 


P' 

”o  0 

QQ'SQQ' (P  Q)  = 

_Q'_ 

0 Q'SQ 

vrtiich  has  the  same  eigenvalues  as  QQ’SQQ'.  Its  eigenvalues  are  )c  zeroes  and  those 
of  the  lower  right  corner  of 


P' 

P'SP 

P 

SQ 

S(P  Q)  = 

b'i 

O'SP 

w. 

Q 

SQ_ 

The  Poincare  separation  theorem  (c.f.  Bellman  (1970),  p.  117)  shows  that 

X^fe'SQl  > *p.kf2'SQ]  1 Xp[S] 

with  at  least  one  strict  inequality  unless  an  orthogonal  matrix  of  the  form 


with  of  order  k exists,  such  that 


R'  0 ■ 

P*' 

R 0 ~\ 

1 

S(P  Q) 

1 

0 Ri 

Q* 

0 R- 

L 2j 

L ZJ 

diag  (X  ....  X^) 


where  ^ i 


^ X^  are  the  eigenvalues  of  S. 


Then  the  columns  of  PRj^  must  constitute  a set  of  k orthonormal  eigenvectors  of  S 

corresponding  respectively  to  the  k largest  eigenvalues  i ^ ^1^-  Finally 

PP'  = PRj^R|P'  gives  the  desired  result,  o 

We  now  state  the  sample  version  of  the  result  by  Okamoto  and  Kanazawa  (1960)  or 

Okamoto  (1969).  Our  intention  is  to  illustrate  how  this  is  connected  to  Pearson  (1901). 

Theorem  5.3.  Let  X denote  the  random  veiriable  which  assumes  each  of  the  observed 

Scunple  values  x.<  i = l,2,...,n  with  probability  — and  0 denote  any  corresponding 
'1  n ' 

k X 1 vector.  Then,  the  eigenvalues  of 


E[X 


- - _ - 1 " 

AU]  [X  - X - AU]  ' = — y (x . - X - Au . ) (x . - 

~ ~ ' n . '1  ' -1  -1 


i=l 


X - Au . ) ' 
-1 


are  simultaneously  minimized  over  all  A of  dimension  p x k and  all  sets  of  values 

{u,}  for  U by 
-1  ' 

Au.  - (P.P'  + ■ • • + P,  P/)  (x.  - X) 

-1  -I'l  -k  k '1 

1 r 

where  P, ,...,P,  is  a set  of  orthonormal  eigenvectors  of  the  matrix  — ) (x,  - x)(x.  - x)  ' 

'1  *k  n '1  ' ~i 

1=1 

corresponding  to  the  k largest  eigenvectors. 

Proof . The  result  follows  directly  from  Theorem  5.2  or  as  a special  case  of  Theorem  3.1.  ° 

We  also  have  another  geometric  interpretation  in  terms  of  approximating  cross 
product  matrices. 

Let  a hyperplane  be  {v|v  = x + Py;  y e R } where  P = (P^^ , . . . ,P|^)  is  p x k with 

orthonormal  columns.  The  projection  of  a given  point  x.  into  this  hyperplane  is  the 

n 

point  v^  = X + PP'  (Xj^  - X)  . Let  S = J (x^^  - x)  (5^  - x)  ' denote  the  cross  products 
matrix  and  let 

n _ _ P 

s “ I (y,  ■ Y)  (y  ' y) ' “I  pp’  (x.  - x)  (x.  - x)  'pp'  = pp'spp’  (s.b) 

V_1  1 “I""'!' 
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denote  the  cross  products  matrix  of  the  projected  points.  We  want  to  find  a hyperplane 

p _ 

in  R , passing  through  the  centroid  x,  and  such  that  is  close  to  S. 

Theorem  5,4.  In  order  to  minimize  the  eigenvalues  of  the  discrepancy  in  sums  of  cross 
product  matrices 

S - S - S - PP'SPP'  , 

V 

over  all  k dimensional  planes  passing  through  x,  we  must  take  the  plane  where  the 
columns  of  P are  a set  of  eigenvectors  of  S corresponding  to  the  k largest  eigenvalues. 
Proof . Let  Q be  such  that  (p:Q)  is  orthogonal.  The  eigenvalues  of  S - PP'SPP' 
are  exactly  the  same  as  those  of 


P’ 

P' 

0 P'SQ 

S(P  Q)  - 

PP'SPP' (P  Q)  = 

Q' 

Q'SP  Q'SQ 

whose  eigenvalues  are  k zeroes  together  with  those  of  Q’SQ. 

From  the  argument  given  after  equation  (5.7),  it  follows  that  the  hyperplane  has  to 
be  of  the  form 

{vjv  = X + Pu;  u e r'''}  (5.9) 

where  P = 'l'''‘'-k  constitute  a set  of  orthonormal  eigenvectors  of 

the  matrix  S corresponding  respectively  to  the  k largest  eigenvalues  i i ^ 


Remar k . If,  alternatively,  we  consider  the  residuals  x. 

'1 

then  the  cross  products  matrix  for  the  residuals  is 
S 


with  as  in  (5.8) , 


E (x . - V . ) (>. . - V . ) ' 
-1  -1  -1  .1 


E(x. 

-1 


PP' (x^  - x))(x^  - X - PP'(x^  - X))' 


and  Theorem  5.2  tells  us  that,  to  minimize  the  eigenvalues  of  S^,  the  hyperplane  must 
also  be  taken  as  in  (5.9) . 

5.5.  n-Space  Interpretation 

A geometrical  interpretation  of  principal  components  in  n space  does  not  sean  to 
)iave  been  given  by  previous  workers.  Let  y ” (Xj^,...,x^)  denote  n observations  on 
the  p-vector  X.  If  the  i-th  row  of  y - (Xj^,...,x^)  is  denoted  by  y*  and  the  i-th 
tow  of  y*  « (x^  - x,...,x^  - x)  by  then  these  rows  are  points  in  r". 
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with  the  mean  corrected  rows  of  the  data  matrix,  it  is  natural  to  apply  Euclidean 
distances  and  inner  products.  Suppose  then,  we  ask  for  the  plane  of  dimension  k(^p) 
which  best  fits  the  p points  determined  by  the  rows  of  x*  which  passes  through 
the  origin.  If 


C - (c, ) - 

'1  -n 

denotes  the  points  in  the  hyperplane  with  r^  being  the  point  closest  to 
want  to  minimize  the  sum  of  the  squared  norms  of  the  rows  of  the  matrix 

= (X*  - C)  = (x?  - c,,...,x*  - c ) (5.10) 

'1  '1  -n  -n 

which  is  clearly  equivalent  to  minimizing  the  sum  of  squares  of  all  the  entries  of  the 
above  matrix  or  to  minimizing  the  sum  of  the  squared  norms  of  the  columns.  From 
Pearson's  result,  the  best  choice  of  C is  then 

C = + •••  + (5.11) 

< 

f 

where  P^,...,Pj^  are  as  in  (5.1). 


y* ' , We 
-1 


For  the  case  k = 1,  we  can  consider  the  problem  as  one  of  first  projecting  the  rows 
of  X*  on  any  n * 1 vector  a' . These  projections  are  given  by  the  rows  of 

X*aaV(a'a)  = (5.12) 

where  C is  of  rank  1 when  rank  (x*)  > 1-  Thus,  the  sum  of  squares  of  the  i-th  row  of 
a ~ 

- C is  the  squared  distance  from  y*  to  the  line  determined  to  a.  Moreover,  we 

know  that  the  choice  a'  = P^x*  = sample  values  of  the 

first  principal  component,  produces  of  the  form  (5.12)  or 

P' x*/P'X*X*P,  = SP  P’x*/P{SP  = 

conforming  to  the  optimal  choice  of  C given  by  (5.11)  with  k = 1. 
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Remark.  The  first  principal  component  minimizes  the  sum  of  squared  distances  to  the 

rows  of  the  mean  corrected  data  matrix  X*.  This  is  illustrated  in  Figure  5.1.  Our 

geometrical  interpretation  helps  us  understand  the  relative  importance  of  a row  with 

large  length  (sample  variance)  in  determining  the  principal  component. 

Remark.  If  the  rows  of  x*  are  first  scaled  to  become  unit  vectors,  then  extracting 

principal  components  from  the  new  matrix  is  equivalent  to  using  the  sample  correlation 

matrix  R.  As  the  vectors  in  the  geometric  interpretation  are  now  all  of  unit  length, 

minimizing  the  sum  of  squared  errors  is  equivalent  to  maximizing  (minimizing)  the  sum 
2 2 

of  cos  9^  (sin  0^)  where  9.  is  the  angle  between  yj  and  a. 

It  is  easy  to  see  that  P^X*  determines  a privileged  line  in  n space.  It  is  the 
line  perpendicular  to  Pjx*>-.-<Pj^  j^X*  that  passes  through  the  origin  and  which  best 
fits  the  rows  of  x*  °r  equivalently  the  line  through  the  origin  which  best  fits  the 
rows  (no  restrictions  now)  of  the  residual  x*  - P.P.'X*  •••  ~P.  .P.'  ,X*. 

A -I'l''  -i-l'i-l'' 


p^x* 


Figure  5.1.  Case  p = 3,  showing  sample  first  principal  component 
minimizing  the  sum  of  squared  distances 


>1' 


12 


rj|2  t ||y*  - r J|2 


from  the  yVs  to  line. 
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